Objective / Problem Statement

We have a data set about red wines quality. We are going to look intothis data set variables, try to find similarities between wines as per their quality and build a predictive model about the quality of the wines. I used some technics that were not seen during the certificate. I learnt them in college earlier in ly life and thought it would be nice to consolidate everything here!

Exploratory analysis

Building the data set

The table has 1599 observations and each one is a different type of red wine, and has 12 variables.

Data

For each wine, we have its caracteristics (its acidity, its PH, its alcohol amount…) as well as its quality.
The first 11 variables are numerical, and give us the caracteristics of each wine.
The last variable, qualiteY, is different from the others : it takes the value 1 if the wine is good quality, 0 if not.

This variable can help building an explicative model from the first 11 variables, explicatives.

Missing values

There are no missing values in this dataset.

Proportion of the target variable

The 2 class of qualiteY are balances

###Variables Distribution

Most of the variables are almost normally distributed, except for residual.sugar and chloride which are skewed on the left.

Variables correlation

After studying the density of the variables, I want to study the links between the other explicatives variables. I am trying to find out if some variables are correlated, meaning if there is a link between two or more variables in a sens were their values are always:

  • correlated towards a similar position in the case of a positive correlation
  • correlated towards an opposite position in the case of a negative correlation

Correlation matrix (Pearson)

  • There is a strong negative correlation between pH and fixed acidity, between citric.acide and volatice.acidity and between pH and citric.acid.
  • There is a positive correlation between fixed acidity, citric acid and density.

Graphical representation

Let’s look more in depths at the links between these variables to understand the nature of the correlation.

We can see there is a linear relation between these different variables. Let’s now look at the links between these explanatory variables and the variable to explain:qualiteY.

Variable relations to the quality of the wine

We are trying to understand if there is an influence from the variables on wine quality.

We plotted the distribution of these 11 explanatory variables depending on the quality of the wine. Comparing the distributions of each variable depending on qualiteY = 0 and qualiteY = 1, we can visually see that the distributions are very different between qualiteY = 0 and qualiteY = 1 for 4 explanatory variables volatile acidity, citric.acid, sulphate and alcohol.

Thus, we are showing only these 4 variables below:

It looks like these variable have an influence on the quality of a wine.

To statiscally prove this intuition, we realised a test of the means for each explanatory variable, between qualiteY = 0 and qualiteY = 1. The hypothesis \(H0\) of the means test is the situation were the means are equal.

Let’s start for instance with volatile.acidity, the test will make us able to check if the mean of volatile.acidity for the wine quality 0 is equal to the mean of volatile.acidity for the wine quality 1.

We fix the threshold at 5%. The alternative hypothesis will be rejected if the p-value is above 5%.

By applying the means test on these 4 variables, here is what we conclude:

The good quality wines have:

  • a level of volatile acidity below bad wines because the P-value is 1.710^{-39} < 5%.

  • a level of citric acid higher than bad wines because the P-value is 10^{-10} < 5%.

  • a level of sulphates higher than bad wines because the P-value is 10^{-18} < 5%.

  • a level of d’alcohol higher than bad wines because the P-value is 10^{-77} < 5%.

In order to go further in the analysis, we are going to perform a principal component analysis (PCA) which will allow us to see if we can flag individuals groups and variables by lowering the dimensions.

Multidimensionnal descriptive statistics : PCA

The analysis in principal components allow to analyse and visualise a dataset with individuals described by multiple quantitative variables. It is therefore possible to study the similarities between individuals with a view on all variables and to understand individual profiles by lowering the dimensions.

I do an PCA on this data set to understand if there is a combination of these 11 explanatory variables than can explain the wine quality.

Analysis of proper values

The variance represent the information within a dataset. The idea is to reduce the number of dimensions while not loosing too much information. We choose to keep 70% of the information from the data set and to reduce the number of dimensions from 11 to 4.

##         percentage of variance cumulative percentage of variance
## comp 1              28.1739313                          28.17393
## comp 2              17.5082699                          45.68220
## comp 3              14.0958499                          59.77805
## comp 4              11.0293866                          70.80744
## comp 5               8.7208370                          79.52827
## comp 6               5.9964388                          85.52471
## comp 7               5.3071929                          90.83191
## comp 8               3.8450609                          94.67697
## comp 9               3.1331102                          97.81008
## comp 10              1.6484833                          99.45856
## comp 11              0.5414392                         100.00000

Scree plot

There is no strong uncoupling on the scree plot, except between the first and the second dimension. We will stay with the analysis of the first 4 axis.

###Variable analysis

With a PCA, each axis is a linear combination of the variables.

Axis 1 and 2

The variance explained with the 2 first axis is 45%.

  • AXE 1 : Axis 1 represents wine acidity. It set against two variables very correlated (citric acid and fixed acidity), with the pH. Un vin acide aura un pH faible pour une mesure de fixed.acidity élevée. We already saw in the correlation matrix that these variables were negatively correlated.

  • AXE 2 : Axis 2 represents the sulfure in the wine (Free sulfure and total.sulfure, positively correlated). These variables are negatively correlated with alcohol.

Axis 3 and 4

No variable is well represented on the axis 3, we can still see a negative correlation between alcohol and volatile.acidity. Axis 4 is represented by chlorides and sulfats which are positively correlated.

###Individual analysis

Concentration ellipse

Good quality wine have a tendancy to have a lower sulfat rate vs bad quality wines. However, acidity doesn’t seem to impact wine quality.

Good quality wines have an alcohol percentage higher and a lower volatile acidity vs lower quality wines. Some individuals are standing out : 152,1436,1477.

These points being very represented, let’s have a closer look at them to understand why they stand out.

We can see that the data residual sugar (except for 152) and free sulfur are high for these 3 individuals - a lot higher than the median.

Logistisc regression

The variable qualiteY is a binary variable which gives us the wine quality, as 0 if the wine is bad quality and 1 if it is a good quality wine. We are going to use this variable in a logistic regression to create a model able to explain a wine quality depending on its caracteristics.

Explanatory model

We are trying to understand which variable are best able to explain a wine quality.

The function step allows to select a model with a step by step procedure based on minimalising the AIC criteria. It allows me to keep only the relevant variables for my model and to delete the variable that do not contribute to it or add noise only.

The model keeps these variables: fixed.acidity - volatile.acidity - citric.acid - chlorides - free.sulfur.dioxide
total.sulfur.dioxide - sulphates - alcohol

## 
## Call:
## glm(formula = qualiteY ~ fixed.acidity + volatile.acidity + citric.acid + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     sulphates + alcohol, family = "binomial", data = vin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3340  -0.8488   0.3242   0.8294   2.3493  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -9.216919   0.949966  -9.702  < 2e-16 ***
## fixed.acidity         0.127271   0.051081   2.492  0.01272 *  
## volatile.acidity     -3.379881   0.477983  -7.071 1.54e-12 ***
## citric.acid          -1.260357   0.560972  -2.247  0.02466 *  
## chlorides            -3.529121   1.509122  -2.339  0.01936 *  
## free.sulfur.dioxide   0.022082   0.008184   2.698  0.00697 ** 
## total.sulfur.dioxide -0.015645   0.002811  -5.565 2.62e-08 ***
## sulphates             2.686254   0.432624   6.209 5.32e-10 ***
## alcohol               0.905412   0.073423  12.331  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2209.0  on 1598  degrees of freedom
## Residual deviance: 1657.8  on 1590  degrees of freedom
## AIC: 1675.8
## 
## Number of Fisher Scoring iterations: 4

Analysis

Alcohol is the most significant variable to forecast a wine quality. It is the variable that brings the most information. The more alcohol increases, the more the probability of having a good quality wine is increasing (positive estimate). On the contrary, the more volatile acidity increases, the more the probability of having a good quality wine is low (negative estimate).

Predictive model

If we need to predict a wine quality on new data, we will build a predictive model on a training model which will be tested on a test dataset to understand our error rate.

Test and training samples creation

The sample contains enough data, we can divide it in 2 samples for test and training.

I make sure of the proportion of qualiteY in my two samples :

Train

Test

The proportions are equivalent.

Model quality

ROC curve

The ROC curve (Receiver Operator Characteristic Curve) represents the ratio of true positive on the y axix vs the ratio of false positives on the y axis.

AUC

The AUC (Area under the curve) gives the classification rate without error compared to a logistic model, on the test sample. It allows to compare the ROC curves between multiple models.

AUC of the predictive model : 0.82

Confusion matrix with a threshold of 0.5

If I get new data, I can expect an error rate of 24.14% with this predictive model, and by picking a threshold of 0.5.

Optimum threshold

We are going to at the threshold that will allow us to minimize this error rate.

##   threshold specificity sensitivity
## 1 0.5139985   0.7837838    0.748538

Let’s fix the threshold to 0.51 to create the confusion matrix.

Confusion matrix with a threshold of 0.51
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0451  0.3105  0.5142  0.5377  0.8057  0.9859

At the threshold of 0.51, the error rate is 23.51%. If I get new data, I can expect an error rate of 23.51% with this predictive model. When we receive new data, we’ll need to re-calibrate it.

Residual analysis

A good residual is a residual without any expored structure. Residuals need to be independant from observations.

We will need to re start the model without line 653.

In theory, 95% of the residual from Student are within the interval [-2,2]. It is the case here as 30 residuals are outside of the interval ( 2.34%).

Comparison with other models

I will compare with other models to understand if they give better results than the logistic regression model.

Interaction model

The AUC with this model is 0.82.

Model without correlated variables

The AUC with this model is 0.82.

Decision Tree

The AUC with this model is 0.77.

Random forest

The AUC with this model is 0.92.

Best model and confusion matrix

Random forest is the model that allows for the best results (highest AUC).

Confusion matric on Random Forest with a threshold of 0.5 :

With a random forest, I can expect an error rate of 15.36% on new data.